Categorical Proportional Difference: A Feature Selection Method for Text Categorization
نویسندگان
چکیده
Supervised text categorization is a machine learning task where a predefined category label is automatically assigned to a previously unlabelled document based upon characteristics of the words contained in the document. Since the number of unique words in a learning task (i.e., the number of features) can be very large, the efficiency and accuracy of the learning task can be increased by using feature selection methods to extract from a document a subset of the features that are considered most relevant. In this paper, we introduce a new feature selection method called categorical proportional difference (CPD), a measure of the degree to which a word contributes to differentiating a particular category from other categories. The CPD for a word in a particular category in a text corpus is a ratio that considers the number of documents of a category in which the word occurs and the number of documents from other categories in which the word also occurs. We conducted a series of experiments to evaluate CPD when used in conjunction with SVM and Naive Bayes text classifiers on the OHSUMED, 20 Newsgroups, and Reuters-21578 text corpora. Recall, precision, and the F-measure were used as the measures of performance. The results obtained using CPD were compared to those obtained using six common feature selection methods found in the literature: χ, information gain, document frequency, mutual information, odds ratio, and simplified χ. Empirical results showed that, in general, according to the F-measure, CPD outperforms the other feature selection methods in four out of six text categorization tasks.
منابع مشابه
Improving the Operation of Text Categorization Systems with Selecting Proper Features Based on PSO-LA
With the explosive growth in amount of information, it is highly required to utilize tools and methods in order to search, filter and manage resources. One of the major problems in text classification relates to the high dimensional feature spaces. Therefore, the main goal of text classification is to reduce the dimensionality of features space. There are many feature selection methods. However...
متن کاملAn Empirical Study of Category Skew on Feature Selection for Text Categorization
In this paper, we present an empirical comparison of the effects of category skew on six feature selection methods. The methods were evaluated on 36 datasets generated from the 20 Newsgroups, OHSUMED, and Reuters-21578 text corpora. The datasets were generated to possess particular category skew characteristics (i.e., the number of documents assigned to each category). Our objective was to dete...
متن کاملTwo Step POS Selection for SVM Based Text Categorization
Although many researchers have verified the superiority of Support Vector Machine (SVM) on text categorization tasks, some recent papers have reported much lower performance of SVM based text categorization methods when focusing on all types of parts of speech (POS) as input words and treating large numbers of training documents. This was caused by the overfitting problem that SVM sometimes sel...
متن کاملCategorical Probability Proportion Difference (CPPD): A Feature Selection Method for Sentiment Classification
Sentiment analysis is to extract the opinion of the user from of the text documents. Sentiment classification using machine learning methods face problem of handling huge number of unique terms in a feature vector for the classification. Thus it is required to eliminate the irrelevant and noisy terms from the feature vector. Feature selection methods reduce the feature size by selecting promine...
متن کاملMMR-based Feature Selection for Text Categorization
We introduce a new method of feature selection for text categorization. Our MMR-based feature selection method strives to reduce redundancy between features while maintaining information gain in selecting appropriate features for text categorization. Empirical results show that MMR-based feature selection is more effective than Koller & Sahami’s method, which is one of greedy feature selection ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008